Quality Red Wine Analysis by Brian J Hartman

Red Wine Data: This datset 1599 observations of 13 variables associated
with the attributes of the red variant of a Portuguses wine, Vino Verde. I will
use this data to see what attributes are most common among “high” quality
Vino Verde wines. In doing so I am going to create three categories of wine based
on their quality rating. These will be: High, Medium, and Low.

Lets take a look at the new structure of our dataset after converting quality
to an ordered factor form and the new varible “rating” that was created to group
quality into three buckets. We will also take a look at the distribution of the
new variable with a histogram.

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ rating              : Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 2 2 2 2 2 2 3 3 2 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality    rating    
##  Min.   : 8.40   3: 10   Low   :  63  
##  1st Qu.: 9.50   4: 53   Medium:1319  
##  Median :10.20   5:681   High  : 217  
##  Mean   :10.42   6:638                
##  3rd Qu.:11.10   7:199                
##  Max.   :14.90   8: 18

In the dataset summary we can see the addition of the new variable “rating” and
the associated counts at each level. We can also see the new summary on quality
since it has been tranformed to a factor; it’s counts are also summarized.

The histogram shows the counts of each rating. It’s interesting to note the
distribution. There is a excessive amount medium rated wines and few highly
rated with even less on the low end. I think at this point I would have to
to question the manual quality assignment or the sample collection method or
maybe both.

Univariate Plots Section

Distribution: From the distribution of the ratings above, we can see there
is a significant number of wines in the Medium rating category meaning they
have a quality value of 5 or 6. I am not sue why that is, it could be due to
data collection method - maybe this is not a random sample. So begin with I
would like to take a look at the distribution of all the other variables paying
particular attention to a few of the variables that, at first thought, I think
would have an impact on quality. These include: Alcohol, pH, Sulphates and
Density.

Univariate Analysis

Data Structure: The dataset on red wines consits of 1599 rows
(observations) of data. Eachobservation contains 13 columns (variables) of
attributes of the data. The primary categorical varible of this dataset is
quality and the remaining variables are contiuous in nature and describe the
chemical and physical properties of the wine. I noticed the following
distributions in the histograms:
- Normal: Quality, pH, Density, Volatile Acidity
- Positive Skewed: Alcohol, Citric Acid, Fixed Acidity, Free Sulfur
Dioxide, Total Sulfur Dioxide, Sulphates
- Long Tail: Chlordies, Residual Sugars

Transform the Data: I’d like to take another look at some of the data
that may be over dispersed. In particular I want to take another look at the
long tail data for Chlorides and Residual Sugars as well as the positive skewed
data of the Sulphates using both log base 10 and square root methods.

It appears both the square root and log transformations brings the data a
litte closer to a normal distribution, however there are still several
outliers.

Bivariate Plots Section

Let’s take a look at the relationship between quality and the 4 variables I
predicted would have the most effect on predicting a high quality wine. Those
include Alcohol, pH, Sulphates, and Density.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] 0.4761663

This plot was originally overplotted so I added an alpha of .3 to get a little
better visual of the data points. We can see in this distribution there is a
correlation between higher alcohol content and a higher quality rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## [1] -0.05773139

There is little visual evidence of a correlation between pH and quality. If
a correlation exists it appears to be a minor negative correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## [1] -0.1749192

Again, as with pH, we are seeing very little correlation here with quality and
what we do see is a neagtive correlation; meaning higher quality wines are less
dense. This may make some sense given that higher quality wines tend to have
a higher alcohol concentraion.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4815 -0.2596 -0.2076 -0.1934 -0.1367  0.3010
## [1] 0.3086419

As with alcohol, we are seeing a poitive correlation between sulphates and
quality. There seems to be several outliers in the quality ranges of 5 and 6
but overall the correlation loos fairly strong.

From the above scatterplots we can see ther appears to be a significant
correlation between the percent alcohol by volume and quality and somewhat of a
correlation between sulphates and quality. pH and Density seem to have no
correlation with the quality of the wine.

Given this information I will create a correalation table between all of the
variables to see what other correlation may exist in order to determine what
further analysis should be performed.

## 
## ---------------------------------------------------------------------------
##           &nbsp;            fixed.acidity   volatile.acidity   citric.acid 
## -------------------------- --------------- ------------------ -------------
##     **fixed.acidity**             1             -0.2561        **0.6717**  
## 
##    **volatile.acidity**        -0.2561             1           **-0.5525** 
## 
##      **citric.acid**         **0.6717**       **-0.5525**           1      
## 
##     **residual.sugar**         0.1148           0.001918         0.1436    
## 
##       **chlorides**            0.09371           0.0613          0.2038    
## 
##  **free.sulfur.dioxide**       -0.1538          -0.0105         -0.06098   
## 
##  **total.sulfur.dioxide**      -0.1132          0.07647          0.03553   
## 
##        **density**            **0.668**         0.02203        **0.3649**  
## 
##           **pH**             **-0.683**          0.2349        **-0.5419** 
## 
##       **sulphates**             0.183            -0.261        **0.3128**  
## 
##        **alcohol**            -0.06167          -0.2023          0.1099    
## 
##        **quality**             0.1241         **-0.3906**        0.2264    
## ---------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ------------------------------------------------------------------------------
##           &nbsp;            residual.sugar   chlorides    free.sulfur.dioxide 
## -------------------------- ---------------- ------------ ---------------------
##     **fixed.acidity**           0.1148        0.09371           -0.1538       
## 
##    **volatile.acidity**        0.001918        0.0613           -0.0105       
## 
##      **citric.acid**            0.1436         0.2038          -0.06098       
## 
##     **residual.sugar**            1           0.05561            0.187        
## 
##       **chlorides**            0.05561           1             0.005562       
## 
##  **free.sulfur.dioxide**        0.187         0.005562             1          
## 
##  **total.sulfur.dioxide**       0.203          0.0474         **0.6677**      
## 
##        **density**            **0.3553**       0.2006          -0.02195       
## 
##           **pH**               -0.08565        -0.265           0.07038       
## 
##       **sulphates**            0.005527      **0.3713**         0.05166       
## 
##        **alcohol**             0.04208        -0.2211          -0.06941       
## 
##        **quality**             0.01373        -0.1289          -0.05066       
## ------------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##           &nbsp;            total.sulfur.dioxide     density         pH      
## -------------------------- ---------------------- ------------- -------------
##     **fixed.acidity**             -0.1132           **0.668**    **-0.683**  
## 
##    **volatile.acidity**           0.07647            0.02203       0.2349    
## 
##      **citric.acid**              0.03553          **0.3649**    **-0.5419** 
## 
##     **residual.sugar**             0.203           **0.3553**     -0.08565   
## 
##       **chlorides**                0.0474            0.2006        -0.265    
## 
##  **free.sulfur.dioxide**         **0.6677**         -0.02195       0.07038   
## 
##  **total.sulfur.dioxide**            1               0.07127      -0.06649   
## 
##        **density**                0.07127               1        **-0.3417** 
## 
##           **pH**                  -0.06649         **-0.3417**        1      
## 
##       **sulphates**               0.04295            0.1485        -0.1966   
## 
##        **alcohol**                -0.2057          **-0.4962**     0.2056    
## 
##        **quality**                -0.1851            -0.1749      -0.05773   
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------------------------
##           &nbsp;            sulphates      alcohol       quality   
## -------------------------- ------------ ------------- -------------
##     **fixed.acidity**         0.183       -0.06167       0.1241    
## 
##    **volatile.acidity**       -0.261       -0.2023     **-0.3906** 
## 
##      **citric.acid**        **0.3128**     0.1099        0.2264    
## 
##     **residual.sugar**       0.005527      0.04208       0.01373   
## 
##       **chlorides**         **0.3713**     -0.2211       -0.1289   
## 
##  **free.sulfur.dioxide**     0.05166      -0.06941      -0.05066   
## 
##  **total.sulfur.dioxide**    0.04295       -0.2057       -0.1851   
## 
##        **density**            0.1485     **-0.4962**     -0.1749   
## 
##           **pH**             -0.1966       0.2056       -0.05773   
## 
##       **sulphates**             1          0.09359       0.2514    
## 
##        **alcohol**           0.09359          1        **0.4762**  
## 
##        **quality**            0.2514     **0.4762**         1      
## -------------------------------------------------------------------

According to the Correlation Table we see that in addition to alcohol and
sulphates – citric.acid, volatile.acidity have a the highest correlation with
quality of wine. We can take a look at this correlation in visual form below:

##   3   4   5   6   7   8 
##  10  53 681 638 199  18
## [1] 0.2263725

According to our correlation table quality has a correlation value of 0.2264
with citric acid. This is a good correlation and can be seen in the above plot.
Reading the provided dataset notes, citric acid adds a taste of freshness to the
wine, this would make sense to see higher rated wines having higher levels of
citric acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] -0.3905578

Interesting enough we see the opposite of citric acid in volatile acids which
again, according to provided notes, has the affect of making a wine taste
unpleasent. So here the correlation with quality is strongly negative - meaning
higher rated wines have less volatile acid concentrations in them.

Below are some additional variable pairs with significant levels of
correlation to each other.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## [1] -0.4961798

I mentioned earlier on the plot for density and quality that we may see this
correlation between density and alcohol. While visually density did not show much
of an impact on quality, it does in fact have a negative correlation at -0.1749.
And as you can see from above density and alcohol have a negative correlation of
-0.4962.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## [1] 0.6680473

Density and fixed acidity have one of the strongest correlation in the model
at 0.668.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## [1] 0.3552834

Density and residual sugar are correlated at 0.3553. If we were going to
investigate this further it may make sense to get a better visual by removing
the outliers and decreasing the breaks on the x axis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] 0.6717034

Fixed acidity and citric acid are correlated at 0.6717. This probably should
not be a surprise as an increase in citric acid would naturally add to the
acidity of the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] -0.6829782

pH and fixed acidity are the higest correlated variable in the dataset at
-0.683. Again, I’m not sure that this adds much to the analysis as one would
expect to see pH decrese at lower levels of acid.

Multivariate Plots Section

We know from our intitial observations that alcohol and volatile acidity
contribute significantly to the quatliy of a red wine. Lets look at combining a few
these varibles to see what that looks like.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

We can see from looking at the above plot that higher quality wines tipcally
have a higher concentration of alcohol while at the same time having a lower
level of volitale acidity.

We can now add in sulphates to the plot to see its imapct on the quality of a
red wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Again we see higher quality wines typically have higer alcohol content, lower
volatile acidty and are higher is sulphates (hue).

Additionlly we can look at the impact of citric acid and sulphates on the
wine rating; see below.

As shown above, we typically see wines with higher ratings consisting of
higher levels of both citric acid and sulphates. We will utilize these
variables along with a few other variables with high correlation to create a
linear model to see if we can predict what makes up a high quality wine below.

Linear Multivariate Model

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = wine)
## m2: lm(formula = as.numeric(quality) ~ alcohol + pH, data = wine)
## m3: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates, 
##     data = wine)
## m4: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates + 
##     density, data = wine)
## m5: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates + 
##     density + citric.acid, data = wine)
## m6: lm(formula = as.numeric(quality) ~ alcohol + pH + sulphates + 
##     density + citric.acid + volatile.acidity, data = wine)
## 
## ========================================================================================================
##                          m1            m2            m3            m4            m5            m6       
## --------------------------------------------------------------------------------------------------------
##   (Intercept)          -0.125         2.426***      1.345***      3.459        18.500       -13.771     
##                        (0.175)       (0.387)       (0.401)      (11.213)      (12.040)      (11.922)    
##   alcohol               0.361***      0.386***      0.367***      0.365***      0.337***      0.342***  
##                        (0.017)       (0.017)       (0.017)       (0.019)       (0.021)       (0.020)    
##   pH                                 -0.850***     -0.635***     -0.641***     -0.402**      -0.478***  
##                                      (0.116)       (0.116)       (0.120)       (0.139)       (0.134)    
##   sulphates                                         0.868***      0.872***      0.811***      0.656***  
##                                                    (0.104)       (0.106)       (0.107)       (0.104)    
##   density                                                        -2.088       -17.745        15.845     
##                                                                 (11.061)      (11.971)      (11.885)    
##   citric.acid                                                                   0.405***     -0.378**   
##                                                                                (0.121)       (0.135)    
##   volatile.acidity                                                                           -1.322***  
##                                                                                              (0.116)    
## --------------------------------------------------------------------------------------------------------
##   R-squared             0.227         0.252         0.283         0.283         0.288         0.342     
##   adj. R-squared        0.226         0.251         0.282         0.282         0.286         0.340     
##   sigma                 0.710         0.699         0.684         0.685         0.682         0.656     
##   F                   468.267       268.888       210.183       157.551       129.110       137.956     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1694.466     -1660.297     -1660.279     -1654.636     -1591.909     
##   Deviance            805.870       779.508       746.896       746.879       741.626       685.664     
##   AIC                3448.114      3396.931      3330.594      3332.558      3323.272      3199.818     
##   BIC                3464.245      3418.440      3357.480      3364.821      3360.912      3242.835     
##   N                  1599          1599          1599          1599          1599          1599         
## ========================================================================================================

The linear multivariant model above can explain approximately 34% of the
variance in wine quality utilizing the following regression formula obtained from
the model:

WineQuality = -13.771 + 0.342(alcohol) - 0.478(pH) + 0.656(sulphates)
+ 15.845(sulphates) - 0.378(citric.acid) - 1.322(volatile.acidity)


Final Plots and Summary

Plot One

## 
##     3     4     5     6     7     8 
##  0.63  3.31 42.59 39.90 12.45  1.13

Description One

This was the first plot I created and one of the most surprising things to me
was the number of wines with a quality number of 5 or 6. Combined they make up
over 80% of the occurances in the dataset. With such a high concentration of
occurances in these two buckets, I assumed this would be a difficult assignment
to predict the qualities of a good wine.

Plot Two

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Description Two

This plot was intersting because I just happened to stumble on it. I was
trying to see how many variables I could fit on one plot yet still make it
readable. I was also looking to use the varible rating that I created which
groups the quality ratings into three groups. I think this plot does a lot
without being too busy. It shows how percent alcohol, sulphates, and volatile
acidity combine together in higher rated qualities of wine.

Plot Three

Description Three

I took this plot a few steps further than the one I originally created in the
beginning of the analysis. Alcohol has the stronest correlation with quality of
all the varables. I added box plots to the distributions along with a summary
of the mean for each quality level which empasize the strong correlation between
alcohol and quality.


Reflection

First I would like to address future work that could be done with this
dataset. I am sure ther would be a lot of other things tht could be done with
this data, but to be more thorough I believe there should be a more
comprehesvie data set. The lack of data on both ends of the spectrum in low
low quality and high quality wines could could help solidify any models created
with the data.
Even given the preceived lack of data, there was enough evidence to suggested a
strong correlation between quality and alcohol as well as between quality and
sulphates, citric.acid, and volatile.acidity. The majority of my analysis
focused on these variables. I was able to come up with a model that can account
for 34% of the variance associated with the quality rating of this particular
red wine. My assumption is that there is a lot more that goes into producing
quality red wines than what I have exposed here. I would also be willing to bet
that there are certain geographical and ecologicals variable that play a huge
part in wine making not to mention the human nature of grading wines.
It would be interesting to see a different dataset the takes on many other
than what was presented here.


References

1. https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt \
(dataset description)

2. http://rmarkdown.rstudio.com/authoring_basics.html (r - markdown documentation)

3. https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/select\
(dyplr - select function so remove colums)

4. https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate\
(dyplr - transform quality back to a number)

5. http://rapporter.github.io/pander/ (install pander to output the crrelation\
table)

6. https://www.rdocumentation.org/packages/memisc/versions/0.99.14.3/topics/mtable\
(sdigits funtion for mtable)

7. https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2\
(center plot title)